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Abstract. This paper initiates formal analysis of a simple, distributed 
algorithm for community detection on networks. We analyze an algo- 
rithm that we call Max-LPA, both in terms of its convergence time and 
in terms of the "quality" of the communities detected. Max-LPA is an 
instance of a class of community detection algorithms called label propa- 
gation algorithms. As far as we know, most analysis of label propagation 
algorithms thus far has been empirical in nature and in this paper we 
seek a theoretical understanding of label propagation algorithms. In our 
main result, we define a clustered version of Erdos-Renyi random graphs 
with clusters Vi , V2 , . . . , 14 where the probability p, of an edge connecting 
nodes within a cluster Vi is higher than p' , the probability of an edge con- 
necting nodes in distinct clusters. We show that even with fairly general 

restrictions on p and p' {p — il 1/4-, ^ for any e > 0, p' = O(p^), where 

n is the number of nodes), Max-LPA detects the clusters Vi,V2, ■ ■ ■ ,Vn 
in just two rounds. Based on this and on empirical results, we conjecture 
that Max-LPA can correctly and quickly identify communities on clus- 
tered Erdos-Renyi graphs even when the clusters are much sparser, i.e., 
with p = £l2Sii for some c > 1. 



1 Introduction 

The problem of efBciently analyzing large social networks spans several areas in 
computer science. One of the key properties of social networks is their community 
structure. A community in a network is a group of nodes that are "similar" to 
each other and "dissimilar" from the rest of the network. There has been a lot of 
work recently on defining, detecting, and identifying communities in real-world 
networks [Sl[7l[21]. It is usually, but not always, the tendency for vertices to be 
gathered into distinct groups, or communities, such that edges between vertices 
in the same community are dense but inter-community edges are sparse [20, 9J. A 
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community detection algorithm takes as input a network and outputs a partition 
of the vertex set into "communities". Detecting communities can ahow us to 
understand attributes of vertices from the network topology alone. 

There are many metrics to measure the "quality" of the communities detected 
by a community detection algorithm. A popular and widely adopted metric is 
graph modularity defined by Newman |21| . This measure is obtained by summing 
up, over all communities of a given partition, the difference between the observed 
fraction of links inside the community and the expected value of this quantity 
for a null model, that is, a random network having the same size and same 
degree sequence. Other popular measures include graph conductance [12j and 
edge betweenness [S]. 

The community detection problem has connections to the graph partitioning 
problem which has been well studied since 1970s [SJ [T31 HH [55]. Graph parti- 
tioning problems are usually modeled as combinatorial optimization problems 
and this approach requires a precise sense of the objective function being opti- 
mized. Sometimes additional criteria such as the number of parts or the sizes 
of parts also need to be specified. In contrast, the notion of communities is rel- 
atively "fuzzy" [5] and changes from application to application. Furthermore, 
researchers in social network analysis are reluctant to over-specify properties 
of communities and would rather let algorithms "discover" communities in the 
given network. For a survey of the different approaches that have been proposed 
to find community structure in networks, see Fortunato's work 0. 

The focus of this paper is a class of seemingly simple community detection 
algorithms called label propagation algorithms (LPA) . Raghavan et al. [U] seem 
to be the first to study label propagation algorithms for detecting network com- 
munities. The advantage of a LPA is, in addition to its simplicity, the fact that 
it can be easily parallelized or distributed. The generic LPA works as follows: 
initially each node in the network is assigned a unique label. In each iteration 
every node updates its label to the label which is the most frequent in its neigh- 
borhood; ties are broken randomly. One obtains variants of LPA by varying how 
the initial label assignment is made, how ties are broken, and whether a node 
includes itself in computing the most frequent label in its neighborhood. In this 
paper, we analyze a specific instance of LPA called Max-LPA in which nodes 
are assigned initial labels uniformly at random from some large enough space. 
Also, if there is a tie, it is broken in favor of the larger label. Finally, a node 
includes its own label in determining the most frequent label in its neighborhood. 

At any point during the execution of a LPA, a community is simply all 
nodes with the same label. The intuition behind using a LPA for community 
detection is that a single label (the maximum label in the case of Max-LPA) 
can quickly become the most frequent label in neighborhoods within a dense 
cluster whereas labels have trouble "traveling" across a sparse set of edges that 
might connect two dense clusters. A LPA is said to have converged if it starts 
cycling through a collection of states. Ideally, we would like LPA to converge 
to a cycle of period one, i.e., to a state in which any further execution of LPA 
yields the same state. However, this is not always possible. In fact, part of the 



difRculty of analyzing LPA stems from the randomized tie-breaking rule. This 
way of breaking ties makes it difficult to estimate the period of the cycle that the 
algorithm eventually converges to. The version of LPA that we analyze, namely 
Max-LPA, does not suffer from this problem because Poljak and Silra [23J have 
shown in a different context that Max-LPA converges to a cycle of period 1 
or 2. 

Despite the simplicity of LPA, there has been very little formal analysis of 
either the convergence time of LPA or the quality of communities produced 
by it. There have been papers [241 1151 3] that provide some empirical results 
about LPAs. For example, the number of iterations of label updates required 
for the correct convergence of LPA is around 5 [24J, but it is hard to derive 
any fundamental conclusions about LPA's behavior, even on specific families 
of networks, from these empirical results. One reason for this state of affairs is 
that despite its simplicity, even on simple networks, LPA can have complicated 
behavior, not unlike epidemic processes that model the spread of disease in a 
networked population pJ3J. Our goal in this paper is to initiate a systematic 
analysis of the behavior of Max-LPA, both in terms of its convergence time 
and in terms of the "quality" of communities produced. 

Watts and Strogatz [27j have pointed out that the classical Erdos-Renyi 
model of random graphs differs from real- world social, technological, and biolog- 
ical networks in several critical ways. Following this, a variety of other random 
graph models have been considered as models of real-world networks. These in- 
clude the configuration model |171[2]. the Watts-Strogatz model [27], preferential 
attachment models [T], etc. (for definitions and more examples, see [18]). There 
is no empirical study or formal analysis of LPAs on these classes of networks. As 
our first step towards developing analysis techniques for LPAs we define a clus- 
tered version of Erdos-Renyi random graphs and present a formal proof of the 
running times of LPAs on these networks. We realize that Erdos-Renyi networks 
and even clustered Erdos-Renyi networks are inadequate models of real world 
networks, but believe that our analysis techniques could be useful in general. 

The variants of LPA can naturally be viewed as distributed algorithms, mean- 
ing each node only has local knowledge, i.e., knowledge of its label and the labels 
of its neighbors obtained by means of message passing along edges of the net- 
works. Distributed algorithms are generally classified as synchronous or asyn- 
chronous algorithms. (The reader is referred to standard books (e.g., [22]) for a 
full exposition of these terms) . Here we analyze a synchronous version of Max- 
LPA. The algorithm proceeds in rounds and in each round each node sends its 
label to all neighbors and then updates its label based on the labels received 
from neighbors and its own label. 

1.1 Preliminaries 

We use G = (V, E) to denote an undirected connected graph (network) of size 
n = \V\. For v e V, we denote by N{v) = {u : u e V,{u,v) G E} the 
neighborhood of v in graph G, by deg{v) = |A^(w)| the degree of ti, and by 
Z\(G) = max„gy deg{v) the maximum degree over all the vertices in G. A k-hop 



Algorithm 1 Max-LPA on a node v 



i = 

lv[i] random(0,l) 
while true do 
« + +; 

send - 1] to \fu G N{v) 
receive lu[i — 1] from Vii G N{v) 

^ max [l I E„eiv'(„)[^4i " 1] ^] > E„6iv'(.) [^4^ " 1] for all I'} 

end while 



neighborhood {k ^ 1) of w is defined as Nk{v) ~ {w : distG(ii;, v) < k} \ {v}. We 
denote the closed neighborhood (respectively, closed k-hop neighborhood) of v as 
A'(i;) = A(i;) U {w} (respectively, N'^{v) = Afe(i;) U {w}). 

Denote by £u{t) the label of node u just before round t. When the round 
number is clear from the context, we use to denote the current label of u. 
Since the number of labels in the network is finite, LPA will behave periodically 
starting in some round t* , i.e., for some p > 1, < i < p, and j = 0, 1, 2, . . ., 

£u{t* + l) = iuit* + i + J ■ p) 

for all u G V. Then we say that Max-LPA has converged in t* rounds. 

We now describe Max-LPA precisely (see Algorithm 1). Every node v G V 
is assigned a unique label uniformly and independently at random. For concrete- 
ness, we assume that these labels come from the range [0, 1]. At the start of a 
round, each node sends its label to all neighboring nodes. After receiving labels 
from all neighbors, a node v updates its label as: 

ly ^ max h] [e^^^e]> [£u e'] for all £' \ , (1) 

^_ ueN'(v) ueN'{v} J 

where [£u == P\ evaluates to 1 ii £u — £, otherwise evaluates to 0. Note that 
there is no randomness in the algorithm after the initial assignments of labels. 

By "w.h.p." (with high probability) we mean with probability at least 1 — 
for some constant c ^ 1. In this paper we repeatedly use the following versions of 
a tail bound on the probability distribution of a random variable, due to Cher- 
noff and Hoeffding |3l [11] . Let Xi , X2 , ■ ■ ■ , Xm be independent and identically 
distributed binary random variables. Let A = ^™ A^. Then, for any < e < 1 
and c ^ 1, 

Pr[A>(l + e).i?[A]]<exp(^-^^ffl) (2) 

Pr [A < (1 - e) • E[X]] < exp (3) 
|A ~ E[X]\ > ^3c-E[X]-\ogn\ < \ (4) 



1.2 Results 

As mentioned earlier, the purpose of this paper is to counterbalance the pre- 
dominantly empirical line of research on LPA and initiate a systematic analysis 
of Max-LPA. Our main results can be summarized as follows: 

— As a "warm-up" we prove (Section ^ that when executed on an n-node 
path Max-LPA converges to a cycle of period one in ©(log n) rounds w.h.p. 
Moreover, we show that w.h.p. the state that Max-LPA converges to has 
f2{n) communities. 

— In our main result (Section we define a class of random graphs that we 
call clustered Erdds-Renyi graphs. A clustered Erdos-Renyi graph G — (V, E) 
comes with a node partition 77 — (Vi, V2, . . . , Vk) and pairs of nodes in each 
Vi are connected with probability pi and pairs of nodes in distinct parts in 
77 are connected with probability p' < miiii{pi}. Since p' is small relative 
to any of the Pi's, one might view a clustered Erdos-Renyi graph as having 
a natural community structure given by 77. We prove that even with fairly 
general restrictions on the p^'s and p' and on the sizes of the Vi''s, Max- 
LPA converges to a period-1 cycle in just 2 rounds, w.h.p. and "correctly" 
identifies 77 as the community structure of G. 

— Roughly speaking, the above result requires each pi to be f2 

We believe that Max-LPA would correctly and quickly identify 77 as the 
community structure of a given clustered Erdos-Renyi graph even when the 
Pi's are much smaller, e.g. even when pi — for c > 1. However, at 

this point our analysis techniques do not seem adequate for situations with 
smaller pi values and so we provide empirical evidence (Section 2]) for our 
conjecture that Max-LPA correctly converges to 77 in 0(polylog(ri)) rounds 
even when pi = '^^°s" foj- gome c > 1 and p' is just a logarithmic factor 
smaller than pi. 

1.3 Related Work 

There are several variants of LPA presented in the literature [U [TOl [261 US]- 
Most of these are concerned about "quality" of the output and present empirical 
studies of output produced by LPA. 

Raghavan et al. [H], based on the experiments, claimed that at least 95% of 
the nodes are classified correctly by the end of 5 rounds of label updates. But 
the experiments that they carried out were on the small networks. 

Cordasco and Gargano [1] proposed a semi-synchronous approach which is 
guaranteed to converge without oscillations and can be parallelized. They pro- 
vided a formal proof of convergence but did bound the running time of the algo- 
rithm. Lui and Murata |16| presented a variation of LPA for bipartite networks 
which converges but no formal proof is provided, neither for the convergence nor 
for the running time. 

Leung et al. [TS] presented empirical analysis of quality of output produced 
by LPA on larger data sets. From experimental results on a special structured 
network they claimed that running time of LPA is O(logn). 




2 Analysis of Max-LPA on a Path 



Consider a path Vn consisting of vertices V = [n] and edge set E = + 1) \ 
1 < i < n}. In this section, we analyze the execution of Max-LPA on a path 
network Vn and prove that in O(logn) rounds Max-LPA converges to a state 
from which no further label updates occur and furthermore in such a state the 
number of communities is f2(n) w.h.p.. 

Lemma 1. When Max-LPA is executed on path network Vn, independent of 
the initial label assignment, it will converge to a state from which no further label 
updates occur. 

Proof. First we show that at any point in the execution of Max-LPA, the 
subgraph of Vn induced by all nodes with the same label, is a single connected 
component. This is true before the first round since the initial label assignment 
assigns distinct labels to the nodes. Suppose the claim is true just before round 
t. Let S = {i,i + 1, . . . ,j) be the subgraph of Vn consisting of nodes with label 
i, just before round t of Max-LPA. 

— If S* contains two or more nodes then none of the nodes in S will ever change 
their label. Moreover, the only other nodes that can acquire label £ in round 
t are nodes i — 1 and j + 1. Hence, after round t, the set of nodes with label 
£ still induces a single connected component. 

— If S' contains a single node, say i, then the only way in which label £ might 
induce multiple connected components after round t would be if in round t: 
(i) node i — 1 acquires label £, (ii) node i + 1 acquires label £, and (iii) node 
i changes its label to some £' ^ £. (i) and (ii) above can only happen if £ is 
larger than the labels of nodes i — \ and i + 1 just before round t. But, if 
this were true, then node i would not change its label in round t. 

Hence, in either case the nodes with label £ would induce a connected component. 

According to Poljak and Sura [23], Max-LPA has a period of 1 or 2 on any 
network with any initial label assignment. To obtain a contradiction we suppose 
that Max-LPA has a period of 2 when executed on Vn for some n and some 
initial label assignment. Therefore for some v G V and some time t, £jj{t + 2i) = £ 
and £y{t + 2i + 1) ^ £' for £ ^ £' and all i = 0, 1, 2, . . .. For v to change its label 
from £ to £' in a round it must be the case that £ < £'. This is because v 
cannot have two neighbors with label £' since £' can only induce one connected 
component. Hence, v acquires the new label £' by tie breaking. By a symmetric 
argument, for v to change its label from £' to £ in the next round, it must be the 
case that £' < £. Both conditions cannot be met and we have a contradiction. □ 

Definition 1. A node v is said to be k-hop maxima if its label £y is (strictly) 
greater than the labels of all nodes in its k-neighborhood. As a short form, we 
will use local maxima to refer to any node that is a 1-hop maxima. 

Let M = {ii, 12, . . . , ir}, ii < 12 ■ • • < V be the set of nodes which are 2-hop 
maxima in Vn for the given initial label assignment. For any 1 < j < r, nodes 
ij and ij+i are said to be consecutive 2-hop maxima. 



Lemma 2. When Max-LPA converges, the number of communities it identifies 
is bounded below by the number of 2-hop maxima in the initial label assignment. 

Proof. Since all initial node labels are assumed to be distinct, in the first round 
of Max-LPA every node u G V acquires a label by breaking ties. Since ties are 
broken in favor of larger labels, all neighbors of each ij G M will acquire the 
corresponding 2-hop maxima label £i^. Thus after one round of Max-LPA, for 
each ij £ M, there are three consecutive nodes in Vn with label £i. . None of 
these nodes will change their label in future rounds and hence there will be a 
community induced by label when Max-LPA converges. □ 

Lemma 3. Let D be the maximum distance in Vn between a pair of consecutive 
nodes in M . Then the number of rounds that Max-LPA takes to converge is 
bounded above by D + 2. 

Proof. Call a node v isolated if its label is distinct from the labels of its neighbors. 
After the first round of Max-LPA each node ij G M and its neighbors acquire 
label £i^. Therefore, after the first round, every connected component of the 
graph induced by isolated nodes has size bounded above by D. We now show 
that in each subsequent round, the size of every connected component of size two 
or more will reduce by at least one. Let 5 be a component in the graph induced 
by isolated nodes, just before round t. Let i be the node with maximum label in 
S. Since S contains at least two nodes, without loss of generality suppose that 
z -I- 1 is also in S. In round t, node i could acquire the label of a node outside S. 
If this happens i would cease to be isolated after round t. Similarly, in round t, 
node i + 1 could acquire the label of a node outside S and would therefore cease 
to be isolated after round t. If neither of these happens in round t, then node 
i + 1 will acquire the label of node i in round t and node i will not change its 
label. In this case, both i and i + 1 will cease to be isolated nodes after round 
t. In any case, we see that the size of the component S has shrunk by at least 
one in round t. Thus in Z? -I- 1 rounds Vn we will reach a state in which all 
components in the graph induced by isolated nodes have size one. Isolated nodes 
whose labels are larger than the labels of neighbors will make no further label 
updates. The remaining isolated nodes will disappear in one more round. □ 

Theorem 1. When Max-LPA is executed on a path Vn, it converges to a state 
from which no further label updates occur in O(logn) rounds w.h.p. Furthermore, 
in such a state there are f2{n) communities. 

Proof. Partition Vn into "segments" of 5 nodes each. Let S denote the set of 
center nodes of these segments. The probability that a node in Vn is a 2-hop 
maxima is . Therefore the expected number of nodes in S that end up being 
2-hop maxima is n/25. Now note that for any two nodes i, j € S, node i being 
a 2-hop maxima is independent of node j being a 2-hop maxima due to the fact 
that there are at least 4 nodes between i and j. Therefore, we can apply the 
lower tail Chernoff bound ([3]) to conclude that w.h.p. at least n/50 nodes in Vn 
are 2-hop maxima. Combining this with Lemma [2] tell us that when Max-LPA 



converges, it does so to a state in which there are at least n/50 communities 
with high probabihty. 

Now consider a contiguous sequence of k 5-node segments. The probabihty 
that none of the centers of the k segments are 2-hop maxima is (4/5)'^. Note 
that here we use the independence of different segment centers becoming 2-hop 
maxima. Hence, for a large enough constant c, the probability that none of the 
centers of fc = clogn consecutive segments are 2-hop maxima is at most 
Using the union bound and observing that there at most n consecutive segment 
sequences of length k, we see that the probability that there is a sequence of 
k = clogn consecutive segments, none of whose centers are 2-hop maxima, is 
at most l/n. Therefore, with probability at least 1 — 1/n every sequence of 
k = clogn consecutive segments contains a segment whose center is a 2-hop 
maxima. It follows that the distance between consecutive 2-hop maxima is at 
most 5c log 71 with probability at least 1 — l/n. The result follows by combining 
this with Lemma [21 □ 

The argument given here establishing a linear lower bound on the number of 
communities can be easily extended to graphs with maximum degree bounded 
by a constant. The argument bounding the convergence time depended cru- 
cially on two properties of the underlying graph: (i) degrees being bounded and 
(ii) number of paths of length 0(log n) being polynomial in number. Thus the 
convergence bound can be extended to other graph classes satisfying these two 
properties (e.g., trees with bounded degree). 

3 Analysis of Max-LPA on Clustered Erdos-Renyi 
Graphs 

We start this section by introducing a family of "clustered" random graphs that 
come equipped with a simple and natural notion of a community structure. We 
then show that on these graphs Max-LPA detects this natural community struc- 
ture in only 2 rounds, w.h.p. provided certain fairly general sparsity conditions 
are satisfied. 

3.1 Clustered Erdos-Renyi graphs 

Recall that for an integer n > 1 and < p < 1, the Erdos-Renyi graph G{n,p) 
is the random graph obtained by starting with vertex set ^ = {1, 2, . . . , n} and 
connecting each pair of vertices u,v S V, independently with probability p. Let 77 
denote a partition (Vi, V2, ■ ■ ■ ,Vk) of V, let tt denote the real number sequence 
(piiP2, ■ ■ ■ ,Pfc), where < < 1 for all i and let < p' < minijpi}. The 
clustered Erdos-Renyi graph G(iT, 7r,p') has vertex set V and edges obtained 
by independently connecting each pair of vertices u,v £ V with probability pi 
ii u,v G Vi for some i and with probability p', otherwise (see Figure [T]). Thus 
each induced subgraph G[Vi] is the standard Erdos-Renyi graph G{ni,pi), where 
ni = \m. 



Vi V2 ■■■ Vk 



Fig. 1: The clustered Erdos-Renyi graph. We connect two nodes in the i-th elhpse 
(i.e., Vi) with probabiHty pi and nodes from different elhpses are connected with 
probability p' < mini{pj}. 

Given that p' < Pi for all i, one might view G{n,n,p') as having a natural 
community structure given by the vertex partition 77. Specifically, when p' is 
much smaller than mmi{pi}, the inter-community edge density is much less than 
the intra-community edge density and it may be easier to detect the community 
structure 77. On the other hand as the intra-community probabilities get closer 
to p', it may be hard for an algorithm such as Max-LPA to identify 77 as the 
community structure. Similarly, if an intra-community probability Pi becomes 
very small, then the subgraph G[Vi\ can itself be quite sparse and independent 
of how small p' is relative to pi , any community detection algorithm may end up 
viewing each Vi as being composed of several communities. 

In the rest of the section, we explore values of the pj's and p' for which 
Max-LPA '"correctly" and quickly identifies 77 as the community structure of 
G{n,n,p'). 

3.2 Analysis 

In the following theorem we establish fairly general conditions on the probabili- 
ties {pi} and p' and on the node subset sizes {ui} and n under which Max-LPA 
converges correctly, i.e., to the node partition 77, w.h.p. Furthermore, we show 
that under these circumstances just 2 rounds suffice for Max-LPA to reach 
convergence! 

Lemma 4. Let G(77, 7r,p') be a clustered Erdos-Renyi graph such that p' < 
mini{^}. Let ii be the maximum label of a node in Vi. Then for any node 
V €Vi the probability that v is not adjacent to a node outside Vi with label higher 
than ii is at least l/2e. 

Proof. Let v' be a node in V\Vi. Given that \Vi\ = rii and £i is the maximum 
label among these n, nodes, the probability that the label assigned uniformly at 



random to v' is larger than £i is + 1). The probabihty that v has an edge 

to v' and v' has a higher label than £i is p' /{rii + 1). Therefore the probability 
that v' has no edge to a node outside Vi with label larger than £i is 



1- P 



n, + 1 

We bound this expression below as follows 



rii + l J \ 111 1 V nj 2e ■ 



□ 



Theorem 2. Lef G{n, n,p') be a clustered Erdds-Renyi graph. Suppose that the 
probabilities {pi} and p' and the node subset sizes {ni} and n satisfy the inequal- 
ities: 

(i) riipf > 8np' and (ii) Uip'l > ISOOclogn, 

for some constant c. Then, given input G(77, 7r,p'), Max-LPA converges cor- 
rectly to node partition U in two rounds w.h.p. (Note that condition (ii) implies 
for each i, pi > and hence each G[Vi] is connected.) 

Proof. Let Vi = {ui,U2, ■ . ■ ,Un-} and without loss of generality assume that 
^ui > > • ■ ■ > £u„. ■ Since all initial node labels are assumed to be distinct, 
in the first round of Max-LPA every node u G V acquires a label by breaking 
ties. Since ties are broken in favor of larger labels, all neighbors of ui in Vi 
that have no neighbor outside Vi with a label larger than iu-^ will acquire the 
label £ui- Consider a node v G Vi. Let (3 denote the probability that v has no 
neighbor outside Vi with label larger than £ui- Note that inequality (i) in the 
theorem statement implies the hypothesis of Lemma |4] and therefore (3 > l/2e. 
The probability that w is a neighbor of ui and does not have a neighbor outside 
Vi is P ■ Pi. Hence, after the first round of Max-LPA, in expectation, Ui ■ (3 ■ pi 
nodes in Vi would have acquired the label i^^ . In the rest of the proof we will 
use 

X ■.= ni- (3 -pi. 

Now consider node uj for j > 1. For a node u G Vi to acquire the label 
it must be the case that v is adjacent to Uj, not adjacent to any node in 
{wi,M2, . . . ,Uj-i}, and not adjacent to any node outside Vi with a label higher 
than £u • Since iu is smaller than £ui , the probability that v is not adjacent to 
a node outside Vi with label higher than £uj is less than /?. Thus the probability 
that a node in Vi acquires the label £u is at most pi{l —piY'^ ■ (3 < Pi{l —pi) ■ (3. 
Furthermore, the probability that a node outside Vi will acquire the label at 
the end of the first round is at most p' . Therefore, the expected number of nodes 
in V that acquire the label Uj , at the end of the first round, is in expectation at 
most Ui ■ pi{\ — Pi) • /3 + (n — ni)p' . We now use inequality (i) and the fact that 
2/3e > 1 to upper bound this expression as follows: 

2/3e • Uip^ ( 3pi \ 
ni-pi{l-pi)- (3 + {n-ni)p' < Ui ■ pi{l - pi) ■ -\ ^ ' ' <ni-Pi ( 1 ^ j -(3. 



Therefore, the expected number of nodes in V that acquire the label Uj, at the 
end of the first round, is in expectation at most 

It is worth mentioning at this point that X — Y — Uipfp/A. 

Note that all the random variables we have utilized thus far, e.g., the number 
of nodes adjacent to ui and not adjacent to any node outside Vi with label 
higher than , can be expressed as sums of independent, identically distributed 
indicator random variables. Hence we can bound the deviation of such random 
variables using the tail bound in In particular, let Y' denote Y + y/3cY log n 
and X' denote X — y/3cX log n. With high probability, at the end of the first 
round of Max-LPA, the number of nodes in Vi that acquire the label ui is at 
least X' and the number of nodes in V that acquire the label j > 1, is at 
most Y' . Next we bound the "gap" between X' and Y' as follows: 



3cX log n — \/ 3cF log n 
2^'icX logn 

2 ^y3cniPi [3 log n 

in,pfp 
5 



The second inequality follows from X — Y = 3nipff3/4 and Y < X, the third 
from the fact that X — riiPiP, and the fourth by using inequality (ii) from the 
theorem statement. 

We now condition the execution of the second round of Max-LPA on the 
occurrence of the two high probability events: (i) number of nodes in Vi that 
acquire the label ui is at least X' and (ii) the number of nodes in V that 
acquire the label j > 1, is at most Y' . Consider a node v € Vi just before 
the execution of the second round of Max-LPA. Node v has in expectation 
at least piX' neighbors labeled in Vi. Also, node v has in expectation at 
most piY' neighbors labeled , for each j > 1, in V. Let us now use X" 
to denote the quantity piX' — ^JicpiX' \ogn and Y" to denote the quantity 
PiY' -\- \/2>cpiY' \ogn. By using ^ again, we know that w.h.p. v has at least X" 
neighbors with label and at most Y" neighbors with a label £u-, j > 1- We 
will now show that X" > Y" and this will guarantee that in the second round 
of Max-LPA v will acquire the label ^u^, with high probability. Since v is an 
arbitrary node in 1^, this implies that all nodes in Vi will acquire the label 



X' - Y' ^X ~Y 

4 
4 

^ 3n^_ 
4 

20 



in the second round of Max-LPA w.h.p. 



X" - Y" = p,{X' - Y') - ^icp.X' \ogn 



v/3cpjF'logn 



>-^-2^icp,X'\ogn 




20 



> 



The second inequaUty fohows from the bound on X' — Y' derived earher and 
Y' < X', the third from the fact that X' < UiPiP, and the fourth by using 
inequahty (ii) from the theorem statement. 

Thus at the end of the second round of Max-LPA, w.h.p., every node in Vi 
has label ■ This is of course true, w.h.p., for all of the V^'s. Now note that every 
node V G Vi has, in expectation riiPi neighbors in Vi and fewer than np' neighbors 
outside Vi. Inequality (i) implies that np' < rupi/S and inequality (ii) implies 
that TiiPi = J?(log n). Pick a constant e > such that niPi{l + e)/8 < niPi{l ~ e). 
By applying tail bound ([2]), we see that w.h.p. v has more than niPi{\ — e) 
neighbors in Vi and fewer than niPi{l + e)/8 neighbors outside Vi. Hence, w.h.p. 
V has no reason to change its label. Since v is an arbitrary node in an arbitrary 
Vi, w.h.p. there are no further changes to the labels assigned by Max-LPA. □ 

To understand the implications of Theorem [5] consider the following example. 
Suppose that the clustered Erdos-Renyi graph has 0(1) clusters and each cluster 
had size 0{n). In such a setting, inequality (ii) from the theorem simplifies 
to requiring that each pi = {{log n/nY^^) and inequality (ii) simplifies to 
p' < Pi/c for all i. This tells us, for instance, that Max-LPA converges in 
just two rounds on a clustered Erdos-Renyi graph in which each cluster has 
0{ri) vertices and an intra-community probability of 0{l/n^^^) and the inter- 
community probability is 0{l/n^/^). 

This example raises several questions. If we were willing to allow more time for 
Max-LPA to converge, say 0(log n) rounds, could we significantly weaken the 
requirements on the p^'s and p' . Specifically, could we permit an intra-community 
probability Pi to become as small as clogn/n for some constant c > 1? Similarly, 
could we permit the inter-community probability p' to come much closer to the 
smallest pi, say within a constant factor. 

We believe that it may be possible to obtain such results, but only via sub- 
stantively different analysis techniques. 



4 Empirical Results on Sparse Erdos-Renyi Graphs 



In the previous section we proved that if the clusters (each Vi) in a clustered 
Erdos-Renyi graphs were dense enough and the inter-cluster edge density (frac- 
tion of edges between nodes in different Vi) was relatively low, then Max-LPA 
would correctly converge in just 2 rounds. Specifically, our result requires each 

cluster to be Erdos-Renyi random graph G{n,p) with p = O ^ ^ '°f^ " ^ ^- In 

this section we ask: how does Max-LPA behave if individual clusters are much 
sparser? For example, how does Max-LPA behave on G{n,p) with much smaller 
p, say p = for some c > f . The proof technique used in the previous section 

does not extend to such small values of p. However, we believe that Max-LPA 
converges quickly and correctly even on clustered Erdos-Renyi graphs whose 
clusters are of the type G{n,p) for p = £i^Ii for c > 1. In this section, we ask 
(and empirically answer) two questions: 

1. Can one expect there to be a constant c such that Max-LPA, when run 
on G{n,p) with p > £i2iii will, with high probability, terminate with one 
community. If the answer to Question 1 is "yes" what might the running time 
of Max-LPA, as a function of n be for appropriate values of p. 

2. Consider a clustered Erdos-Renyi graph with two parts Vi and V2 of equal 
size, and each p, — £i2Sii foj- some c > 1. Let n' = — for some c'. Are 
there constants c, c' for which Max-LPA will quickly converge and correctly 
identify (Vi, V2) as the community structure? 

We are interested in values of p of the form £i2Sii because 12EJ1 jg the threshold 
for connectivity in Erdos-Renyi graphs 

4.1 Simulation Setup 

We implemented Max-LPA in a C program and executed on a Linux machine 
(with 2.4 GHz Intel(R) Core(TM)2 processor). We examined the number of 
rounds it takes and also number of communities it declares at the end of the 
execution. We executed Max-LPA on G{n,p) and on G{n,TT,p') with 77 = 
(yi, V2), \Vi\ = IV2I — n/2, TT — {p,p), p' = 0.6/n for various values of n and p. 
For each n, p combination we ran Max-LPA 50 times. We used p values of the 
form sd2SJl for various values of c > 1. 

n — 

4.2 Results 

We executed Max-LPA using the setup discussed above. Table [T] shows the 
number of simulations out of 50 simulations per n and c values for which it 
ended up in a single community for each pair of n and c. If the input graph is 
disconnected then obviously there will be multiple communities. Therefore, we 
also noted number of simulations for which the graph was connected and this 
number is shown in the brackets. 



Table 1: This table shows simulations on Erdos-Renyi graphs G{n,p) where 
p = £i2iii. Each entry in the table shows the number of simulations out of 50 
simulations per n and c values in which a single community is declared by Max- 
LPA and number of simulations in which the graph G{n,p) was connected is 
shown in brackets. 



n 


c = 1 


c = 1.2 


c = 1.5 


c= 1.7 


1000 


44 (50) 


47 (47) 


50 (50) 


50 (50) 


2000 


42 (46) 


47 (50) 


47 (50) 


50 (50) 


4000 


45 (47) 


49 (50) 


50 (50) 


50 (50) 


8000 


47 (48) 


50 (50) 


50 (50) 


50 (50) 


16000 


49 (50) 


50 (50) 


50 (50) 


50 (50) 


32000 


49 (50) 


50 (50) 


50 (50) 


50 (50) 


64000 


50 (50) 


50 (50) 


50 (50) 


50 (50) 


128000 


50 (50) 


50 (50) 


50 (50) 


50 (50) 



Running Time of Max-LPA 




1 2 3 4 5 6 7 



log 

& 1000 

Fig. 2: Number of rounds for Max-LPA when executed on sparse Erdos-Renyi 
(averaged over simulations where it ended with a single community out of 50 
simulations per n and p) . 



It is well known that p = is a threshold for connectivity in Erdos-Renyi 
graphs and therefore we are getting few runs for c = 1 where the input graph 
was disconnected. From Table [TJ we can say that Max-LPA when executed on 



Erdos-Renyi graphs with p — and c > 1, with high probabihty. terminate 

with one community. It also seem to be the case that as c increases, we are 
getting more single community runs. This is because as c increases, the graph 
become more dense. 

Figure m shows a plot of the number of rounds Max-LPA takes to converge 
on G{n,p) as n increases, averaged over all simulations which resulted in a single 
community at the end of the execution. The running time seems to grow in a 
linear fashion with logarithm of graph size. Also as c increases the running time 
decreases, which implies that as the graph becomes more dense Max-LPA con- 
verges more quickly to a single community. Our results lead us to conjecture that 
when Max-LPA is executed on Erdos-Renyi graphs G{n,p) with p ~ OQ^^^) 
it will, with high probability, terminate with a single community in O(logn) 
rounds. 

Table [2] shows the number of simulations out of 50 simulations per n and c 
values for which Max-LPA correctly identified the partition II when executed 
on G{II,Tr,p') for p' — —. From previous results in Table[l] for c = 1.5 Max- 
LPA declared a single community when executed on G{n,p) w.h.p. Therefore 
in this experiments we started with c — 1.5. But for c — 1.5, the influence from 
the nodes from other partition is significant. As c increases this influence is not 
significant compared to the influence from nodes within the same partition. 



Table 2: This table shows simulations of Max-LPA on G'(i7, 7r,p') with U = 
(Vi, F2), \Vi\ - IF2I - n/2, vr = {p,p), where p = ^ and p' = Each entry 
in the table shows, for particular n and c values, the number of simulations out 
of 50 in which Max-LPA identified two communities Vi and V2. The number of 
simulations in which graph was connected is shown in brackets. 



n 


c = 1.5 


c = 2 


c = 4 


1000 


22 (45) 


39 (50) 


50 (50) 


2000 


21 (39) 


40 (50) 


50 (50) 


4000 


22 (36) 


47 (50) 


50 (50) 


8000 


14 (38) 


47 (50) 


50 (50) 


16000 


26 (35) 


49 (49) 


50 (50) 


32000 


17 (33) 


49 (49) 


50 (50) 


64000 


26 (34) 


46 (50) 


50 (50) 


128000 


5 (35) 


47 (47) 


50 (50) 



5 Future Work 

We believe that with some refinements, the analysis technique used to show 
0(log n)-rounds convergence of Max-LPA on paths, can be used to show poly- 
logarithmic convergence on sparse graphs in general, e.g., those with degree 
bounded by a constant. This is one direction we would like to take our work in. 



At this point the techniques used in Section [3] do not seem applicable to more 
sparse clustered Erdos-Renyi graphs. But if we were willing to allow more time 
for Max-LPA to converge, say 0(log7i) rounds, could we significantly weaken 
the requirements on the p^'s and p'7 Specifically, could we permit an intra- 
community probability pi to become as small as clogn/n for some constant 
c > 1? Similarly, could we permit the inter-community probability p' to come 
much closer to the smallest pi, say within a constant factor? This is another 
direction for our research. 
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